Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend nonzero to int64 #125850

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

extend nonzero to int64 #125850

wants to merge 16 commits into from

Conversation

bhack
Copy link
Contributor

@bhack bhack commented May 9, 2024

Fixes #51871

@pytorch-bot pytorch-bot bot added the release notes: cuda release notes category label May 9, 2024
Copy link

pytorch-bot bot commented May 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125850

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 3 Unrelated Failures

As of commit a46722a with merge base 8f30f36 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@bhack
Copy link
Contributor Author

bhack commented May 9, 2024

/cc @ezyang @eqy This is an explorative blackbox PR as I don't have free cuda resources right now and we don't have a quick way to setup the env to contribute sparse c++/cuda PR (see #125297).

But I made this editable on your side in the case you have the env ready and a quick fix is enough.

@bhack bhack marked this pull request as ready for review May 9, 2024 16:52
@bhack bhack requested a review from eqy as a code owner May 9, 2024 16:52
Copy link

linux-foundation-easycla bot commented May 10, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Collaborator

@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a (presumably large tensor) test for this?

@bhack
Copy link
Contributor Author

bhack commented May 10, 2024

Should we add a (presumably large tensor) test for this?

Do we had an INT_MAX test already somewhere that we could expand?

@eqy
Copy link
Collaborator

eqy commented May 10, 2024

Unfortunately these are not really unified at the moment, but this should surface some examples: https://github.com/search?q=repo%3Apytorch%2Fpytorch+64bit+language%3APython+path%3A%2F%5Etest%5C%2F%2F&type=code

@bhack
Copy link
Contributor Author

bhack commented May 10, 2024

As we don't have a specific CUDA test do we want to find a workaround from python?

Can you suggest one from grep -R torch.nonzero test/?

@bhack
Copy link
Contributor Author

bhack commented May 10, 2024

I think I am going to close this as cub::DispatchSelectIf probably it will be slower then cub::DeviceSelect::Flagged we are currently using.

Probably we need to wait upstream for NVIDIA/cccl#1422

What do you think?

@bhack
Copy link
Contributor Author

bhack commented May 11, 2024

@ezyang Do you think we can we open a new ticket to lower this with Trition where and sum?
https://github.com/pytorch/pytorch/blob/a174c536f8f32b41da9efa647e364196423468a5/torch/_inductor/lowering.py#L2187C20-L2187C35

Edit:
The ticket is at #126003

using flag_iterator_t = cub::NullType*;
using equality_op_t = cub::NullType;

return cub::DispatchSelectIf<
Copy link
Contributor Author

@bhack bhack May 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this requries cub/cccl 2.4.0?

@ezyang
Copy link
Contributor

ezyang commented May 11, 2024

yes, need a big tensor test. @eqy's link is good for examples

@bhack
Copy link
Contributor Author

bhack commented May 11, 2024

Ok thanks,
so I am going to close it as I don't have the env and currently spare GPU computing to write a brand new test and recompile it.
At least if we don't identify another python test that it is already indirectly using nonzero and that we could modify it with a big input.

@bhack
Copy link
Contributor Author

bhack commented May 11, 2024

Just to check if it could compile at least with the current CUB version.
Do you know what is this CI failure?

/usr/local/cuda/include/cub/agent/agent_select_if.cuh(264): error: function "at::native::<unnamed>::NonZeroOp<T>::operator() [with T=c10::complex<c10::Half>]" cannot be called with the given argument list
            argument types are: (int64_t)
            object type is: at::native::<unnamed>::NonZeroOp<c10::complex<c10::Half>>
                  selection_flags[ITEM] = select_op(items[ITEM]);

@bhack
Copy link
Contributor Author

bhack commented May 13, 2024

I think we need cub/cub/agent/agent_select_if.cuh changes introduced at NVIDIA/cccl#1379

So this mean that we need to wait for the next cuda 12.4 update and make it also conditional.

@ezyang
Copy link
Contributor

ezyang commented May 14, 2024

This PR seems fine. I agree you may need to preprocessor your way to victory. CI will say.

@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 14, 2024
@ezyang ezyang self-requested a review May 14, 2024 18:02
@bhack
Copy link
Contributor Author

bhack commented May 23, 2024

@ezyang the new CUDA 12.5 delivers CUB 2.4.0 so it could be enough for this workaround.

@ezyang
Copy link
Contributor

ezyang commented May 28, 2024

if you're willing to wait for cuda 12.5 :)

@bhack
Copy link
Contributor Author

bhack commented May 28, 2024

if you're willing to wait for cuda 12.5 :)

This version is required for the workaround API. A full upstream solution it will require to wait more CUDA releases.

@ezyang
Copy link
Contributor

ezyang commented May 29, 2024

Sorry, I'm not sure I understand the state of this PR. I would happily accept a PR that makes nonzero int64 work on CUDA 12.5 or later, subject to the requirement that this functionality is preprocessored out. If you don't mind waiting, we can also ice this PR until CUDA 12.6 shows up by default and then we can just land it as is.

@bhack
Copy link
Contributor Author

bhack commented May 30, 2024

The current status for testing/using the workaround in this PR is to have CUB 2.4.0.
As in pytorch we currently use CUB from the official CUDA distribution it means we need upgrade pytorch to support CUDA 12.5 (any plan?).

For a "regular" upstream solution we need that this will be merged:
NVIDIA/cccl#1422

And after that we need an official CUB release and that release will be included in a CUDA release.
So we have no guaranties that we will have an upstream solution with CUDA 12.6

@bhack bhack mentioned this pull request May 30, 2024
@bhack
Copy link
Contributor Author

bhack commented May 30, 2024

@ezyang
Copy link
Contributor

ezyang commented May 31, 2024

@atalman for CUDA 12.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open source release notes: cuda release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tensor.nonzero fails on GPU for tensors containing more than INT_MAX elements
4 participants